\section{Introduction}

Twitter, a micro-blogging system combining social network and text content, has demonstrated itself as a leading breaking news provider, and a platform of sharing opinions and interests. Currently, according to Twitter Blog\cite{twitterblog}, there are about over millions of users publishing more than 200 millions of tweets on Twitter every single day. Although there are existing successful social networks websites before Twitter comes out, like Facebook, MySpace etc., Twitter becomes popular because of its simplicity and people find it easy to tweet and share information. There are several characteristics which differentiate Twitter with other social networks:

\begin{itemize}

\item \textbf{Non-Reciprocal Relationship: } In Twitter, users are connected with links called "follow". If user A is following user B, all tweets which are produced by B will appear in user A's timeline. However, this "follow" link does not require user B's permission and also B is not required to follow back. This characteristic means a user could follow anyone he/she is interested in freely. In Facebook or MySpace, users are connected by "friendship" and this link relationship is reciprocal meaning that two users need to agree with each other to be in such a "friendship".

\item \textbf{Incomplete Profile: } As Facebook provides a detailed profile covering personal information, family, education and work, interests, Twitter provides a simple bio information for user to describe him/herself. However, the bio information in Twitter is limited by 160 characters and does not contain any structured format. This short bio will make it difficult to know a user's profile, including affilication, occupation, interests etc., and thus hard for us to make more services, such as recommendation.

\item \textbf{User-generated Content as Plain Text: } While Facebook allows users to post different types of user-generated contents, such as status, photos, videos and external urls, Twitter only has one form of user-generated content - plain text data within 140 characters (so-called tweet). However, users could insert shortened urls of an image, a video or a webpage into tweets. This allows us to analyze users' tweets easily with only text data but also make it hard to target users' interests in a specific area, like music or movies.

\end{itemize}

User profiles act as an important factor in user modeling since the profile is completed by users and often is of good quality. With an accurate user modeling, many useful applications could be accomplished such as recommendation systems, advertisement delivery, information prioritization. However, as mentioned above, the design of short bio in Twitter makes it difficult to know a user's real information. According to our experiment, about 27.2\% of users don't have a self-written bio or their bio is less than 5 characters. About 43.9\% of users have a bio less than 10 characters, which indicates that almost half of users don't have a meaningful bio to describe themselves. In order to cope this problem, we could make use of the structure of the social network as well as the content information in users' tweets to extract such hidden profile for those who don't have a informative bio because they are lazy or they don't want to post those information. For example, if a user is studying at UCLA, even though he/she might not explicitly write this fact in his/her bio, we may find the following observations from his/her activities in Twitter:

\begin{enumerate}

\item In his/her social networks, including users he/she is following and users who is following him/her. there might be a considerate amount of users who is explicitly indicating they are students at UCLA.
\item In his/her tweets, she/he might post about what's happening in UCLA or something related to UCLA which could infer that she/he is a student at UCLA. She/he could also retweet tweets containing such information.

\end{enumerate}

Based on above observations, in this paper, we explore models and approaches to extract user's hidden profile from what they are connected in the social networks and what they are tweeting on Twitter. Specifically, on a social network graph, given a category and a set of seed users in the category, we want to find users in the graph who is likely to belong to this category and give out a ranking of users based on how likely they are to be in this category. In our solution, the approaches could be divided into two categories: 1) exploiting link structure in the graph only; 2) exploiting user-generated content information only, including bio, location and tweets. After describing three models in the above two categories, we also propose a co-training algorithm to combine the advantages of models in both categories to achieve a better performance with iterative reinforcing the result of one algorithm by the result of the other algorithms. With the help of information from both perspectives, we are achieving a 87\%  result at top-50 precison metric for UCLA dataset.

% User on Facebook has a complete profile by filling out his education, work and interest, while Twitter user can only fill out a piece of 160 characters bio information. It is difficult to precisely recommend related users, news and services on Twitter with limited user profile information. Thus, extracting a user's hidden profile is a valuable topic which improves user experience and advertisement revenue at the same time. Additionally, it will also help to improve the search result relevance by assigning a higher rank to the results in which the user is more interested.

% We will propose an algorithm to extract a user's hidden profile from his followers and followings using the closeness and tweets similarity of two users, which is based on the intuition that people with similar background and interests tend to connect with each other.

%However, proposing an efficient solution to the problem is very challenging due to sparseness and incompleteness of the following relation data. The Twitter network graph is a very large but spare graph which involves over a hundred million users, while most of them are connected to only hundreds of users. Additionally, some users have more than a thousand followers which cannot be completely retrieved due to the limit of Twitter API. Thus, our algorithm should be able to predict with sparse and incomplete following relation data.

The rest of the paper is organized as follows.
Section~\ref{sec:problem} presents the mathematical formulation of the problem. Section~\ref{sec:method} presents four solutions to this problem. Section~\ref{sec:experiment} reports the experiment results. Section~\ref{sec:discussion} discusses some problems discovered in the experiment. The paper concludes in Section~\ref{sec:conclusion}.

%Section \ref{sec:related-work} compares our work with related work.
